194 research outputs found
Adding hygiene to gambit scheme
Le langage de programmation Scheme est reconnu pour son puissant
système de macro-transformations. La représentation
du code source d'un programme, sous forme de données manipulables
par le langage,
permet aux programmeurs de modifier directement
l'arbre de syntaxe abstraite sous-jacent.
Les macro-transformations
utilisent une syntaxe similaire aux procédures régulières mais,
elles définissent plutôt des procédures à exécuter
lors de la phase de compilation.
Ces procédures retournent une représentation sous
forme d'arbre de syntaxe abstraite qui devra être substitué
à l'emplacement de l'appel du transformateur. Les procédures
exécutées durant la phase de compilation profitent
de la même puissance que celles exécutées durant de la phase d'évaluation.
Avec ce genre de système de macro-transformations,
un programmeur peut créer des règles de syntaxe spécialisées
sans aucun coût additionnel en performance:
ces extensions syntactiques
permettent l'abstraction de code sans les coûts d'exécution
habituels reliés à la création d'une fermeture sur le tas.
Cette représentation pour le code source de Scheme provient
directement du langage de programmation Lisp. Le code source
est représenté sous forme de listes manipulables
de symboles, ou bien de
listes contenants d'autres listes: une structure appelée
S-expression. Cependant, avec cette approche simpliste,
des conflits de noms peuvent apparaître.
En effet, l'association référée par un certain identifiant
est déterminée exclusivement par
le contexte lexical de celui-ci.
En déplaçant un identifiant dans l'arbre de syntaxe abstraite,
il est possible que cet identifiant se retrouve dans
un contexte lexical contenant une certaine association pour un identifiant du même nom.
Dans de tels cas,
l'identifiant déplacé pourrait ne plus référer à l'association
attendue, puisque cette seconde
association pourrait avoir prévalence sur
la première. L'assurance de transparence référentielle est alors perdue.
En conséquence, le choix de nom pour les identifiants
vient maintenant influencer directement
le comportement du programme,
générant des erreurs difficiles à comprendre.
Les conflits de noms
peuvent être corrigés manuellement dans le code en utilisant,
par exemple, des noms d'identifiants uniques.
La préservation automatique de la transparence référentielle
se nomme hygiène, une notion qui a été beaucoup
étudiée dans le contexte
des langages de la famille Lisp.
La dernière version du Scheme revised report, utilisée
comme spécification pour le langage, étend ce dernier
avec un support pour les macro-transformations hygiéniques.
Jusqu'à maintenant,
l'implémentation Gambit de Scheme ne fournissait
pas de tel système à sa base. Comme contribution, nous
avons ré-implémenter le système de macro de Gambit pour
supporter les macro-transformations hygiéniques au plus bas niveau
de l'implémentation. L'algorithme choisi se base sur l'algorithme
set of scopes implémenté dans le langage Racket et créé par Matthew Flatt.
Le langage Racket s'est grandement inspiré
du langage Scheme mais, diverge
sur plusieurs fonctionnalités importantes. L'une de
ces différences est le puissant système de macro-transformation
sur lequel Racket base la majorité de ses primitives.
Dans ce contexte, l'algorithme a donc été testé
de façon robuste.
Dans cette thèse, nous donnerons un aperçu du langage
Scheme et de sa syntaxe. Nous énoncerons le problème d'hygiène
et décrirons différentes stratégies utilisées
pour le résoudre. Nous justifierons par la suite
notre choix d'algorithme et fourniront une définition
formelle. Finalement, nous présenterons une analyse
de la validité et de la performance du compilateur en
comparant la version originale de Gambit avec notre
version supportant l'hygiène.The Scheme programming language is known for
its powerful macro system.
With Scheme source code represented as actual Scheme data,
macro transformations
allow the programmer, using that data, to act directly on the
underlying abstract syntax tree.
Macro transformations use a similar syntax to
regular procedures but, they define procedures
meant to be executed at compile time.
Those procedures return an abstract syntax tree representation
to be substituted at the transformer's call location.
Procedures executed at compile-time use the same
language power as run-time procedures.
With the macro system,
the programmer can create specialized
syntax rules without additional performance costs.
This also allows for code abstractions
without the expected run-time cost of closure creations.
Scheme's representation of source code using values
inherits that virtue from the Lisp programming language.
Source code is represented as a list of symbols, or lists
of other lists: a structure coined S-expressions.
However, with this simplistic approach,
accidental name clashes can occur.
The binding to which an identifier refers to
is determined by the lexical context of that identifier.
By moving an identifier around in the abstract syntax tree,
it can be caught within the lexical context of another binding definition with the same name.
This can cause unexpected behavior for programmers
as the choice of names can create substantial changes
in the program.
Accidental name clashes can be manually fixed in the code,
using name obfuscation, for instance.
However, the programmer becomes responsible
for the program's safety.
The automatic preservation of referential transparency
is called hygiene and was
thoroughly studied in the context
of lambda calculus and Lisp-like languages.
The latest Scheme revised report, used as a specification for the
language, extend the language with hygienic macro
transformations.
Up to this point, the Gambit Scheme implementation
wasn't providing a built-in hygienic macro system.
As a contribution, we re-implemented Gambit's
macro system to support hygienic transformations
at its core.
The algorithm we chose is
based on the set of scopes algorithm, implemented in the
Racket language by Matthew Flatt.
The Racket language is heavily based on Scheme but,
diverges on some core features.
One key aspect of the Racket language is
its extensive hygienic syntactic macro system, on
which most core features are built on:
the algorithm
was robustly tested in that context.
In this thesis, we will give an overview of the Scheme language
and its syntax. We will state the hygiene problem and describe
different strategies used to enforce hygiene automatically.
Our algorithmic
choice is then justified and formalized. Finally, we
present the original Gambit macro system and explain
the changes required. We also provide a validity and performance
analysis, comparing the original Gambit implementation to
our new system
Advanced Document Description, a Sequential Approach
To be able to perform efficient document processing, information systems need to use simple models of documents that can be treated in a smaller number of operations. This problem of document representation is not trivial. For decades, researchers have tried to combine relevant document representations with efficient processing. Documents are commonly represented by vectors in which each dimension corresponds to a word of the document. This approach is termed “bag of words”, as it entirely ignores the relative positions of words. One natural improvement over this representation is the extraction and use of cohesive word sequences. In this dissertation, we consider the problem of the extraction, selection and exploitation of word sequences, with a particular focus on the applicability of our work to domain-independent document collections written in any language
ICDAR 2019 Competition on Post-OCR Text Correction
International audienceThis paper describes the second round of the ICDAR 2019 competition on post-OCR text correction and presents the different methods submitted by the participants. OCR has been an active research field for over the past 30 years but results are still imperfect, especially for historical documents. The purpose of this competition is to compare and evaluate automatic approaches for correcting (denoising) OCR-ed texts. The present challenge consists of two tasks: 1) error detection and 2) error correction. An original dataset of 22M OCR-ed symbols along with an aligned ground truth was provided to the participants with 80% of the dataset dedicated to training and 20% to evaluation. Different sources were aggregated and contain newspapers, historical printed documents as well as manuscripts and shopping receipts, covering 10 European languages (Bulgarian, Czech, Dutch, English, Finish, French, German, Polish, Spanish and Slovak). Five teams submitted results, the error detection scores vary from 41 to 95% and the best error correction improvement is 44%. This competition, which counted 34 registrations, illustrates the strong interest of the community to improve OCR output, which is a key issue to any digitization process involving textual data
Receipt Dataset for Fraud Detection
International audienceThe aim of this paper is to introduce a new dataset initially created to work on fraud detection in documents. This dataset is composed of 1969 images of receipts and the associated OCR result for each. The article details the dataset and its interest for the document analysis community. We indeed share this dataset with the community as a benchmark for the evaluation of fraud detection approaches
Dataset for Temporal Analysis of English-French Cognates
Languages change over time and, thanks to the abundance of digital corpora, their evolutionary analysis using computational techniques has recently gained much research attention. In this paper, we focus on creating a dataset to support investigating the similarity in evolution between different languages. We look in particular into the similarities and differences between the use of corresponding words across time in English and French, two languages from different linguistic families yet with shared syntax and close contact. For this we select a set of cognates in both languages and study their frequency changes and correlations over time. We propose a new dataset for computational approaches of synchronized diachronic investigation of language pairs, and subsequently show novel findings stemming from the cognate-focused diachronic comparison of the two chosen languages. To the best of our knowledge, the present study is the first in the literature to use computational approaches and large data to make a cross-language diachronic analysis.Peer reviewe
Automatic Matching and Expansion of Abbreviated Phrases without Context
International audienceIn many documents, like receipts or invoices, textual information is constrained by the space and organization of the document. The document information has no natural language context, and expressions are often abbreviated to respect the graphical layout, both at word level and phrase level. In order to analyze the semantic content of these types of document, we need to understand each phrase, and particularly each name of sold products. In this paper, we propose an approach to find the right expansion of abbreviations and acronyms, without context. First, we extract information about sold products from our receipts corpus and we analyze the different linguistic processes of abbreviation. Then, we retrieve a list of expanded names of products sold by the company that emitted receipts, and we propose an algorithm to pair extracted names of products with the corresponding expansions. We provide the research community with a unique document collection for abbreviation expansion
Automatic Discovery of Word Semantic Relations
In this paper, we propose an unsupervised methodology to automatically discover pairs of semantically related words by highlighting their
local environment and evaluating their semantic similarity in local and global
semantic spaces. This proposal di®ers from previous research as it tries to take
the best of two different methodologies i.e. semantic space models and information extraction models. It can be applied to extract close semantic relations,
it limits the search space and it is unsupervised
Extended Overview of HIPE-2022: Named Entity Recognition and Linking in Multilingual Historical Documents
This paper presents an overview of the second edition of HIPE (Identifying Historical People, Places and other Entities), a shared task on named entity recognition and linking in multilingual historical documents. Following the success of the first CLEF-HIPE-2020 evaluation lab, HIPE-2022 confronts systems with the challenges of dealing with more languages, learning domain-specific entities, and adapting to diverse annotation tag sets. This shared task is part of the ongoing efforts of the natural language processing and digital humanities communities to adapt and develop appropriate technologies to efficiently retrieve and explore information from historical texts. On such material, however, named entity processing techniques face the challenges of domain heterogeneity, input noisiness, dynamics of language, and lack of resources. In this context, the main objective of HIPE-2022, run as an evaluation lab of the CLEF 2022 conference, is to gain new insights into the transferability of named entity processing approaches across languages, time periods, document types, and annotation tag sets. Tasks, corpora, and results of participating teams are presented. Compared to the condensed overview [1], this paper contains more refined statistics on the datasets, a break down of the results per type of entity, and a discussion of the ‘challenges’ proposed in the shared task
Find it! Fraud Detection Contest Report
International audienceThis paper describes the ICPR2018 fraud detection contest, its data set, evaluation methodology, as well as the different methods submitted by the participants to tackle the predefined tasks. Forensics research is quite a sensitive topic. Data are either private or unlabeled and most of related works are evaluated on private datasets with a restricted access. This restriction has two major consequences: results cannot be reproduced and no benchmarking can be done between every approach. This contest was conceived in order to address these drawbacks. Two tasks were proposed: detecting documents containing at least one forgery in a flow of documents and spotting and localizing these forgeries within documents. An original dataset composed of images and texts of French receipts was provided to participants. The results they obtained are presented and discussed
"Let Everything Turn Well in Your Wife" : Generation of Adult Humor Using Lexical Constraints
Peer reviewe
- …